A Suggested Revision to the
Forthcoming 5th Edition of the
APA Publication Manual
Effect size. Because p values are confounded, joint functions of several study features, including effect size and sample size, calculated p values are not useful indices of study effects. As emphasized by the APA Task Force on Statistical Inference (Wilkinson & APA Task Force on Statistical Inference, 1999), "reporting and interpreting effect sizes in the context of previously reported effects is essential to good research" (p. 599, emphasis added).
Reporting effect sizes has three important benefits. First, reporting effects facilitates subsequent meta-analyses incorporating a given report. Second, effect size reporting creates a literature in which subsequent researchers can more easily formulate more specific study expectations by integrating the effects reported in related prior studies. Third, and perhaps most importantly, interpreting the effect sizes in a given study facilitates the evaluation of how a study's results fits into existing literature, the explicit assessment of how similar or dissimilar results are across related studies, and potentially informs judgment regarding what study features contributed to similarities or differences in effects.
For these reasons the 1994 fourth edition of the Publication Manual "encouraged" (p. 18) effect size reporting. However, 11 empirical studies of one or two post-1994 volumes of 23 journals found that this admonition had little, if any, impact (Vacha-Haase, Nilsson, Reetz, Lance & Thompson, 2000).
The reasons why the "encouragement" was ineffective, as reflected in the literature summary presented by Vacha-Haase et al. (2000), appear to be clear. As Thompson (1999) noted, only "encouraging" effect size reporting
presents a self-canceling mixed-message. To present an "encouragement" in the context of strict absolute standards regarding the esoterics of author note placement, pagination, and margins is to send the message, "these myriad requirements count, this encouragement doesn't." (p. 162)
Consequently, this edition of the Publication Manual incorporates as a requirement, "Always provide some effect-size estimate when reporting a p value" (Wilkinson & APA Task Force on Statistical Inference, 1999, p. 599, emphasis added).
In classical statistics, effect sizes characterize the fit of a model (e.g., a fixed-effects factorial ANOVA model) to data. Similarly, in structural equation modeling (SEM) goodness of fit indices may be thought of as effect sizes.
In a few analyses (e.g., randomization tests) effect size indices have not yet been formulated. However, confidence intervals intervals are quite useful in these instances, just as they are even when effect sizes can be computed. Reporting confidence intervals, especially in direct comparison with the confidence intervals from related prior studies, falls squarely within the spirit of required effect size reporting. The graphic presentation of confidence intervals can be particularly helpful to readers.
Numerous effect sizes can be computed. Useful reviews of various choices are provided by Kirk (1996), Olejnik and Algina (2000), Rosenthal (1994), and Snyder and Lawson (1993). However, a brief review of the available choices may be useful. Although there is a class of effect sizes that Kirk (1996) labelled "miscellaneous" (e.g., the odds ratios that are so important in loglinear analyses), there are two major classes of effect sizes for parametric analyses.
The first class of effect sizes involves standardized mean differences. Effect sizes in this class include indices such as Glass' D , Hedges' g, and Cohen's d. For example, Glass' D is computed as the difference in the two means (i.e., experimental group mean minus control group mean) divided by the control group standard deviation, where the SD computation uses n-1 as the denominator. When the study involves matched or repeated measures designs, the standardized difference is computed taking into account the correlation between measures (Dunlap, Cortina, Vaslow & Burke, 1996).
Of course, not all studies involve experiments or only a comparison of group means. Because all parametric analyses are part of one General Linear Model family, and are correlational, variance-accounted-for effect sizes can be computed in all studies, including both experimental and non-experimental studies. Effect sizes in this second class include indices such as r2, R2, and h 2. For example, for regression, R2 can be computed as the sum-of-squares explained divided by the sum-of-squares total. Or, for a one-way ANOVA, h 2 is computed as the sum-of-squares explained divided by the sum-of-squares total.
The General Linear Model is a powerful heuristic device (cf. Cohen, 1968), as suggested by commonalties in variance-accounted-for effect size formulas. However, in many applications it is advisable to convert these indices to unsquared metrics, for reasons summarized elsewhere (cf. D'Andrade & Dart, 1990; Ozer, 1985). When measures have intrinsically meaningful non-arbitrary metrics, as occasionally occurs in psychology, unstandardized effect indices may be more useful than standardized differences or variance-accounted-for or r statistics (Judd, McClelland & Culhane, 1995).
The effect sizes in these two classes--standardized differences and r--can be transformed into each others' metrics. For example, a Cohen's d can be converted to an r using Cohen's (1988, p. 23) formula #2.2.6:
r = d / [(d2 + 4).5]When total sample size is small or group sizes are disparate,
it is advisable to use a slightly more complicated but more precise formula
elaborated by Aaron, Kromrey and Ferron (1998):
r = d / [(d2 + [(N2 -
2N)/(n1 n2)] .5].
Or an r can be converted to a d using Friedman's (1968, p. 246) formula #6:
d = [2 (r)] / [(1 - r2).5]In addition to choosing between standardized difference and variance-accounted-for (or r) effect sizes, researchers must choose between "uncorrected" and "corrected" effect sizes. Like people, each individual sample has its own personality, or variance that is unique to that given sample. The effect sizes computed for a sample are inflated by capitalizing on this "sampling error variance."
However, we know what factors contribute to sampling error variance. Samples have more sampling error variance when (a) sample sizes are smaller, (b) the number of observed variables is larger, and (c) the population effect size is smaller. Because we know what factors contribute to sampling error variance, we can estimate the amount of positive bias in a variance-accounted-for effect size, and then estimate a "shrunken" or "corrected" effect size with the estimating sampling error variance removed. The "corrected" variance-accounted-for effect sizes include indices such as "adjusted R2," Hays' w 2, and Herzberg's R2.
No one effect size is appropriate for all research situations. However, psychology as a field will be more fully informed by inquiry in which researchers report and interpret an effect size, whatever that index may be.
It should also be noted that Cohen (1988) provided rules of thumb for characterizing what effect sizes are small, medium, or large, as regards his impressions of the typicality of effects in the social sciences generally. However, he emphasized that the interpretation of effects requires the researcher to think more narrowly in terms of a specific area of inquiry. And the evaluation of effect sizes inherently requires an explicit researcher personal value judgment regarding the practical or clinical importance of the effects. Finally, it must be emphasized that if we mindlessly invoke Cohen's rules of thumb, contrary to his strong admonitions, in place of the equally mindless consultation of p value cutoffs such as .05 and .01, we are merely electing to be thoughtless in a new metric.